VVC Decoder Analysis

1. 目前VVC Codec现状

VVC标准已经于2020年正式成为国际标准,我们很想知道到目前为至VVC codec在产业界和学术界的实现情况是怎样的呢?
最新的JVET-V0021文档列举了VVC标准的最新实现进展。下面选取了JVET-V0021文档中对这部分的总结介绍, 更详细的信息请参考该文档。

1.1 Publicly available software source code

1) JVET has developed the VVC Test Model (VTM) as its reference software encoder and decoder codebase [5]. It is intended primarily to demonstrate coding efficiency capability and proper interpretation of the syntax and decoding process specified in the standard (but not as a speed-optimized implementation), and is intended to be usable as a starting basis for product implementations. The software is available under a BSD copyright licence.

2) InterDigital developed a multi-threaded VTM decoder, and reported 6–10× speed-up relative to the single-threaded reference software [6]. It is intended to support all features of the VTM. The software was later placed in an accessible repository, and it is available under the same BSD copyright licence as the VTM software [7].

3) Fraunhofer HHI announced the VVenC encoder and VVdeC decoder open-source software (release 0.1) in September 2020 [8][9][10][11][12]. It includes support for multithreading operation, single-pass rate control, perceptual QP adaptation, and motion-compensated temporal filtering (MCTF). The software has four defined presets for quality/speed tradeoff (called “slow”, “medium”, “fast”, and “faster”). Subjective testing reported in October 2020 indicated that the VVenC encoder had about the same or better subjective compression performance as the VTM encoder when operating in its “medium” speed configuration (operating with MCTF and QP adaptation disabled in the VTM and enabled in VVenC and with rate control disabled in both) with encoding speed more than 100× that of the VTM, for 4K UHD SDR video content [12][13]. As of December 2020, a “slower” preset was added, along with an improved single-pass rate control and a new two-pass rate control [14]. The “slower” preset mode reportedly achieves approximately the same BD-rate coding efficiency as the VTM while providing a speedup of more than 8.6x for UHD and 5.2x for HD sequences relative to the VTM. As of December 2020, with release 0.2, the software is available under a BSD copyright licence. Release 0.3 of March 2021 includes substantial further speed and multithreading improvements [15].

4) Friedrich–Alexander University Erlangen–Nürnberg released an open-source bitstream analyser as an add-on for the VTM decoder [16][17]. The analyzer counts the occurrence of coding tools and coding modes used in a decoded bitstream and can be used for evaluating the decoding energy and time demands of VVC features. The software is available under a BSD copyright licence.

1.2 Software decoders

1) Sharp announced a real-time software decoder in June 2020, and issued a corresponding press release in December 2020 [18][19]. As of June 2020, it was reportedly capable of decoding 4K CTC UHD bitstreams at up to 40 Mbps at more than 60 fps.

2) Tencent announced its O266dec software decoder with SIMD and multithreading support and an associated FFmpeg/VLC-based video player in October 2020 [12][20][21]. As of December 2020, it is reportedly more than 3× the speed of the VTM reference software decoder when tested under VVC common test conditions (CTC) in single-threaded operation and about 20× the VTM decoder speed in 8-thread operation. It could reportedly decode UHD video at more than 60 fps at up to 40 Mbps and decode full HD video at more than 200 fps. In December 2020, a version with mobile platform support based on Arm Neon technology was reported. On an Apple A14 processor (iPhone 12pro) in single-threaded operation, it could reportedly decode 8-bit 1080p CTC bitstreams at more than 50 fps, and in multi-threaded operation it could decode such bitstreams at more than 100 fps and could decode 8-bit 4K UHD bitstreams at more than 30 fps in the RA configuration [22]. Although 8-bit operation was more optimized, the decoder also supports 10-bit operation.

3) Alibaba announced its Ali266 decoder for mobile devices (e.g., Android and iPhone) in December 2020 [23]. It includes optimizations for multi-threading, ARM assembly, cache efficiency, and memory usage, particularly for 8 bit video content. Real-time speed is reported for 8 bit 720p, 1080p (using 2–4 threads for up to 60 fps) and 4K (up to 7 Mbps) video content with the ALF feature disabled.

1.3 Bitstream analyser products

1) Elecard announced support for VVC in its StreamEye and StreamAnalyzer products in April 2020 [24].

2) ViCueSoft supports VVC in its VQ Analyzer bitstream analysis product, as of late 2020 [25].

1.4 Conformance test sets

1) A conformance test set is under development by JVET. It reached the CD stage of the ISO/IEC approval process in October 2020 [26].

2) Allegro DVT began offering a conformance test set for VVC as early as January 2020 (i.e., initially in preliminary form before the finalization of the standard) [27][28][29].

1.5 Encoding products and services

1) KDDI Research announced a real-time VVC encoder with 4K @60 fps capability in September 2020 [30].

2) Ateme launched support for VVC in its Titan family of products, and demonstrated the technology in an OTT channel launched in November 2020 [31].

3) Bitmovin, in partnership with Fraunhofer HHI based on VVencC as described in item 3, announced support of VVC in its video encoding platform in November 2020 [32].

2. VVDec流程介绍

VVDec是Fraunhofer HHI开发的VVC解码器,实现了VVC解码器和进一步的优化,包括多线程优化和x86 SIMD优化。本文后面部分重点介绍一下VVDec中有关多线程解码的部分。

2.1 解码流程介绍

VVC codec标准也是采用了传统的编解码器框架,由帧内预测,帧间预测,变换编码,量化,熵编码,环内滤波等子算法组成。VVC能提升编码压缩率的原因是对上述的子算法都进行了fine tuned,采用了复杂度更高,计算量更大的方法来提高压缩率。
VVC decoder的解码过程也是和传统的video decoder框架一致。首先是parser阶段,采用熵解码对基本的语法元素进行解码,然后是反量化过程,反量化的结果作为变换解码的输入,变换解码输出为残差值。这个残差值和解码得到的预测值相加就得到了解码数据。预测值解码数据又分成帧内预测解码和帧间预测解码两种,分别从时间域和空间域得到相应的预测数据。解码数据最后还需要进行滤波处理才能作为最后的解码输出和帧间预测解码的输入数据。

2.2 CTU粒度多线程解码

我们知道video decoder的并行处理方式有GOP并行,frame并行,slice并行等。这些并行处理方式的实现相对简单,也是比较通用,不同的codec可以采取类似的机制来实现,如ffmpeg就把frame并行和slice并行做成了简单的框架,然后各个codec再调用这个通用的框架来实现并行处理。前面说的并行处理的缺点是并行处理的任务分配可能不均匀,不能很好地利用目前CPU多核的架构来进行充分地codec优化。VVdec实现的是另外一种并行处理机制。该机制把一帧图片的解码分别分成几部分并行执行,每个部分对应一个CTU行或者几个CTU行。我们把这个并行出来任务称为CTU task。这种并行处理方式粒度更小,更能充分利用多核的能力。
CTU task的划分如下图所示。

每个CTU task的解码过程再分成下面这些子任务。

1
2
3
4
enum TaskType
{
/*TRAFO=-1,*/ MIDER, LF_INIT, INTER, INTRA, RSP, LF_V, LF_H, PRESAO, SAO, ALF, DONE, DMVR
};

每个子任务开始之前都需要检查当前子任务的依赖是否已经执行完。如果依赖已经完成,则继续往下执行。如果依赖没有完成,则把该任务重新放入线程池中等待下次调度。在任务执行的每个阶段开始执行的时候都需要判断当前阶段对应的依赖有没有完成,如果没有,则退出当前任务执行,把任务重新放入线程池中等待下次调度。这个过程如下图所示。

下面来看一下这些子任务的依赖分别是什么。首先把图片划分成多个ctu task, 每个ctu task包括若干个ctu。每个ctu task是一个线程执行任务的最小单位。每个ctu task都有一个对应的TaskType来表示当前task对应解码状态,也就是说表示当前解码进行到哪一步了。这样我们可以通过比较依赖ctutask的TaskType来判断当前ctu task的依赖条件是否满足。

  1. INTER的依赖
    数组thisLine和lineAbove分别表示当前ctu task行和上一个ctu task行的子任务状态。下面的代码表示当前ctu task需要等待

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    if( std::all_of( cs.picture->slices.begin(), cs.picture->slices.end(), []( const Slice* pcSlice ) { return pcSlice->isIntra(); } ) )
    {
    // not really necessary, but only for optimizing the wave-fronts
    if( col > 1 && thisLine[col - 2] <= INTER )
    return false;
    if( line > 0 && lineAbove[col] <= INTER )
    return false;
    }
    if( std::any_of( cs.picture->refPicExtDepBarriers.cbegin(), cs.picture->refPicExtDepBarriers.cend(), []( const Barrier* b ) { return b->isBlocked(); } ) )
    {
    return false;
    }
  2. INTRA的依赖
    数组thisLine和lineAbove分别表示当前ctu task行和上一个ctu task行的子任务状态。下面的代码表示当前ctu task需要等待左边和右上的ctu task执行完成INTRA子任务才能开始执行(好像漏了正上方ctu task的判断?后续需要进一步确认)。

    1
    2
    3
    4
    if( col > 0 && thisLine[col - 1] <= INTRA )
    return false;
    if( line > 0 && lineAbove[std::min( col + 1, widthInCtus - 1 )] <= INTRA )
    return false;
  3. LF_V的依赖
    垂直滤波需要等待左边ctu task完成了INTRA才能开始。

    1
    2
    if( col > 0 && thisLine[col - 1] < LF_V )
    return false;
  4. LF_H的依赖
    水平滤波需要等待右边,上边和右上的ctu task完成了垂直滤波才能开始。

    1
    2
    3
    4
    5
    6
    if( line > 0 && lineAbove[col] < LF_H )
    return false;
    if( line > 0 && col + 1 < widthInCtus && lineAbove[col + 1] < LF_H )
    return false;
    if( col + 1 < widthInCtus && thisLine[col + 1] < LF_H )
    return false;

2.3 ThreadPool的设计

为了实现CTU task的并行,vvdec中设计了一个ThreadPool来完成线程的调度和执行。这个ThreadPool初始化的时候创建N个执行线程,N通常设置为通过std::thread::hardware_concurrency()得到的CPU核数。为了完成无锁化设计,这个线程池提供了类似Fence的机制Barrier来保证前后task之间的依赖关系。也就是说一个task被push到线程池中来执行的时候,其带有特定的Barrier,这个Barrier就是该task执行之前需要保证的条件。
ThreadPool的定义如下所示。
下面来简单介绍一下这个类的定义。

  1. ChunkedTaskQueue是用来保存task的队列。每个task的定义保存在结构体Slot中。task的func是需要真正执行任务的回调函数。readyCheck是task任务执行之前用来判断是否前置条件是否ready的回调函数。barriers也是task执行之前需要满足的前置条件。也就是说task可以被执行的前置条件是readyCheck回调函数需要返回true,加上barriers不能是block的。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    using TaskFunc = bool ( * )( int, void * );
    struct Slot
    {
    TaskFunc func { nullptr };
    TaskFunc readyCheck{ nullptr };
    void* param { nullptr };
    WaitCounter* counter { nullptr };
    Barrier* done { nullptr };
    CBarrierVec barriers;
    std::atomic<TaskState> state { FREE };
    };

ChunkedTaskQueue的定义如下。
先定义了一个包含128的Slot结构体数组的Chunk来保存task。在定义一个指向Chunk的单链表来把更多的task任务链接到一起。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class ChunkedTaskQueue
{
constexpr static int ChunkSize = 128;
class Chunk
{
std::array<Slot, ChunkSize> m_slots;
std::atomic<Chunk*> m_next{ nullptr };
Chunk& m_firstChunk;
Chunk( Chunk* firstPtr ) : m_firstChunk{ *firstPtr } {}
friend class ChunkedTaskQueue;
};
private:
Chunk m_firstChunk{ &m_firstChunk };
Chunk* m_lastChunk = &m_firstChunk;
std::mutex m_resizeMutex;
};

完整的ThreadPool接口定义如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
class ThreadPool
{
typedef enum
{
FREE = 0,
PREPARING,
WAITING,
RUNNING
} TaskState;
using TaskFunc = bool ( * )( int, void * );
struct Slot
{
TaskFunc func { nullptr };
TaskFunc readyCheck{ nullptr };
void* param { nullptr };
WaitCounter* counter { nullptr };
Barrier* done { nullptr };
CBarrierVec barriers;
std::atomic<TaskState> state { FREE };
};
class ChunkedTaskQueue
{
constexpr static int ChunkSize = 128;
class Chunk
{
std::array<Slot, ChunkSize> m_slots;
std::atomic<Chunk*> m_next{ nullptr };
Chunk& m_firstChunk;
Chunk( Chunk* firstPtr ) : m_firstChunk{ *firstPtr } {}
friend class ChunkedTaskQueue;
};
public:
class Iterator : public std::iterator<std::forward_iterator_tag, Slot>
{
Slot* m_slot = nullptr;
Chunk* m_chunk = nullptr;
public:
Iterator() = default;
Iterator( Slot* slot, Chunk* chunk ) : m_slot( slot ), m_chunk( chunk ) {}
Iterator& operator++();
// increment iterator and wrap around, if end is reached
Iterator& incWrap();
bool operator==( const Iterator& rhs ) const { return m_slot == rhs.m_slot; } // don't need to compare m_chunk, because m_slot is a pointer
bool operator!=( const Iterator& rhs ) const { return m_slot != rhs.m_slot; } // don't need to compare m_chunk, because m_slot is a pointer
Slot& operator*() { return *m_slot; }
bool isValid() const { return m_slot != nullptr && m_chunk != nullptr; }
};
ChunkedTaskQueue() = default;
~ChunkedTaskQueue();
ChunkedTaskQueue( const ChunkedTaskQueue& ) = delete;
ChunkedTaskQueue( ChunkedTaskQueue&& ) = delete;
// grow the queue by adding a chunk and return an iterator to the first new task-slot
Iterator grow();
Iterator begin() { return Iterator{ &m_firstChunk.m_slots.front(), &m_firstChunk }; }
Iterator end() { return Iterator{ nullptr, nullptr }; }
private:
Chunk m_firstChunk{ &m_firstChunk };
Chunk* m_lastChunk = &m_firstChunk;
std::mutex m_resizeMutex;
};
public:
ThreadPool( int numThreads = 1, const char *threadPoolName = nullptr );
~ThreadPool();
template<class TParam>
bool addBarrierTask( bool ( *func )( int, TParam* ),
TParam* param,
WaitCounter* counter = nullptr,
Barrier* done = nullptr,
CBarrierVec&& barriers = {},
bool ( *readyCheck )( int, TParam* ) = nullptr )
{
if( m_threads.empty() )
{
// in the single threaded case try to exectute the task directly
if( bypassTaskQueue( (TaskFunc)func, param, counter, done, barriers, (TaskFunc)readyCheck ) )
{
return true;
}
}
else
{
checkAndThrowThreadPoolException();
}
while( true )
{
#if ADD_TASK_THREAD_SAFE
std::unique_lock<std::mutex> l(m_nextFillSlotMutex);
#endif
CHECKD( !m_nextFillSlot.isValid(), "Next fill slot iterator should always be valid" );
const auto startIt = m_nextFillSlot;
#if ADD_TASK_THREAD_SAFE
l.unlock();
#endif
bool first = true;
for( auto it = startIt; it != startIt || first; it.incWrap() )
{
first = false;
auto& t = *it;
auto expected = FREE;
if( t.state.load( std::memory_order_relaxed ) == FREE && t.state.compare_exchange_strong( expected, PREPARING ) )
{
if( counter )
{
counter->operator++();
}
t.func = (TaskFunc)func;
t.readyCheck = (TaskFunc)readyCheck;
t.param = param;
t.done = done;
t.counter = counter;
t.barriers = std::move( barriers );
t.state = WAITING;
#if ADD_TASK_THREAD_SAFE
l.lock();
#endif
m_nextFillSlot.incWrap();
m_poolPause.unpauseIfPaused();
return true;
}
}
#if ADD_TASK_THREAD_SAFE
l.lock();
#endif
m_nextFillSlot = m_tasks.grow();
}
return false;
}
bool processTasksOnMainThread();
void checkAndThrowThreadPoolException();
void shutdown( bool block );
void waitForThreads();
int numThreads() const { return (int)m_threads.size(); }
private:
using TaskIterator = ChunkedTaskQueue::Iterator;
struct TaskException;
// members
std::string m_poolName;
std::atomic_bool m_exitThreads{ false };
std::vector<std::thread> m_threads;
ChunkedTaskQueue m_tasks;
TaskIterator m_nextFillSlot = m_tasks.begin();
#if ADD_TASK_THREAD_SAFE
std::mutex m_nextFillSlotMutex;
#endif
std::mutex m_idleMutex;
std::atomic_bool m_exceptionFlag{ false };
std::exception_ptr m_threadPoolException;
PoolPause m_poolPause;
// internal functions
void threadProc ( int threadId );
static bool checkTaskReady ( int threadId, CBarrierVec& barriers, TaskFunc readyCheck, void* taskParam );
TaskIterator findNextTask ( int threadId, TaskIterator startSearch );
static bool processTask ( int threadId, Slot& task );
bool bypassTaskQueue( TaskFunc func, void* param, WaitCounter* counter, Barrier* done, CBarrierVec& barriers, TaskFunc readyCheck );
static void handleTaskException( const std::exception_ptr e, Barrier* done, WaitCounter* counter, std::atomic<TaskState>* slot_state );
};

其中的核心函数定义如下。
addBarrierTask()提供接口把外部任务push到ThreadPool中来异步执行。

threadProc()是每个线程的loop执行函数,每个线程创建好了以后,threadProc就会在一个while循环中不停地从queue中取出task(如果有的话)来执行。如果queue里面没有task需要执行了,threadProc会进入等待状态。

findNextTask()用于从queue找到一个可以被执行的task。其会调用checkTaskReady()函数用来确保task可以被执行。

checkTaskReady()函数用于检查Barrier和readCheck是否已经准备好。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
bool ThreadPool::checkTaskReady( int threadId, CBarrierVec& barriers, ThreadPool::TaskFunc readyCheck, void* taskParam )
{
if( !barriers.empty() )
{
// don't break early, because isBlocked() also checks exception state
if( std::count_if( barriers.cbegin(), barriers.cend(), []( const Barrier* b ) { return b && b->isBlocked(); } ) )
{
return false;
}
}
// don't clear the barriers, even if they are all unlocked, because exceptions could still be singalled through them
// barriers.clear();
if( readyCheck && readyCheck( threadId, taskParam ) == false )
{
return false;
}
return true;
}

processTask()用来调用task的执行回调函数执行真正的任务。如果任务没有执行完成(返回false),则该task会重新送入queue中等待下一步再被调用到来执行回调函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
bool ThreadPool::processTask( int threadId, ThreadPool::Slot& task )
{
try
{
const bool success = task.func( threadId, task.param );
if( !success )
{
task.state = WAITING;
return false;
}
if( task.done != nullptr )
{
task.done->unlock();
}
if( task.counter != nullptr )
{
--(*task.counter);
}
}
catch( ... )
{
throw TaskException( std::current_exception(), task );
}
task.state = FREE;
return true;
}

下面这个图详细说明了ctu解码过程中是如何利用thread pool进行task的调度的。
DecLibRecon把ctu task push到threadpool中的queue中,这些task都带有前置条件。
threadpool中的线程启动loop函数threadProc, threadProc会去queue中取task。如果取得的task的前置条件都满足,则该thread可以去执行对应的task。在执行的过程中某一个子任务完成后会去更新task type。并判断下一步的子任务是否可以启动,如果不行的话该线程退出执行任务,并把任务重新push到queue中,如果可以下一步的子任务可以启动则继续往下执行直至该任务完全结束。

3 参考

JVET-V0021-v1, Deployment status of the VVC standard
vvdec